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Abstract 

Scale-space theory has been estabHshed pri- 
marily by the computer vision and signal pro- 
cessing communities as a well-founded and 
promising framework for multi-scale process- 
ing of signals (e.g., images). By embedding 
an original signal into a family of gradually 
coarsen signals parameterized with a contin- 
uous scale parameter, it provides a formal 
framework to capture the structure of a signal 
at different scales in a consistent way. In this 
paper, we present a scale space theory for text 
by integrating semantic and spatial filters, and 
demonstrate how natural language documents 
can be understood, processed and analyzed at 
multiple resolutions, and how this scale-space 
representation can be used to facilitate a vari- 
ety of NLP and text analysis tasks. 



1 Introduction 

Physical objects in the world appear differently 
depending on the scale of observation/measurement. 
Take the tree as an example, meaningful obser- 
vations range from molecules at the scale of 
nanometers, to leaves at centimeters, to branches at 
meters, and to forest at kilometers. This inherent 
property is ubiquitous and holds equally true for 
natural language. On the one hand, concepts are 
meaningful only at the right resolution, for instance, 
named entities usually range from unigram (e.g., 
"new") to bigram (e.g., "New York"), to multigram 
(e.g., "New York Times"), and even to a whole 
long sequence (e.g., a song name " Another Lonely 
Night In New York"). On the other hand, our under- 
standing of natural language depends critically on 
the scale at which it is examined, for example, de- 
pending on how much detailed we would like to get 



into a document, our knowledge could range from 
a collection of ''keywords" , to a sentence sketch 
named ''title", to a paragraph summary named 
"abstract", to a page long "introduction" and finally 
to the entire content. The notion of scale is funda- 
mental to the understanding of natural language, yet 
it was largely ignored by existing models for text 
representation, which include simple bag-of-word 
(BOW) or unigram language model (LM), n-gram 
or higher order LMs, and other more advanced 
text/language models ([Iyer and Ostendorf, 1996 



Manning and Schuetze, 1999 



[Metzler and Croft, 2005 | l. One key problem 
with many of these models is their inflexibility — 
they capture the semantic structure rather rigidly 
at only a single resolution (e.g., n-gram with a 
single fixed value of n). However, which scale is 
appropriate for a specific task is usually unknown 
a priori and in many cases even not homogeneous 
(e.g., a document may contain named entities of 
different length), making it impossible to capture 
the right meanings with a fixed single scale. 

Scale space theory is a well-established and 
promising framework for multi-resolution represen- 
tation, developed primarily by the computer vision 
and signal processing communities with compli- 
mentary motivations from physics and bio-vision. 
The key idea is to embed a signal into the scale 
space, i.e., to represent it as a family of progres- 
sively smoothed signals parameterized by a continu- 
ous variable of scale, where fine-resolution detailed 
structures are progressively suppressed by the con- 
volution of the original signal with a smoothing ker- 
nel (i.e., a low pass filter with certain properties) 
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dWitkin, 198 3 HLindeberg, 1994| ). 

In this paper, we adapt the scale-space model 
from image to text signals, proposing a novel frame- 
work that enables multi-resolution representation for 



documents. The adaptation poses substantial chal- 
lenges as the structure of the semantic domain is 
nontrivially complicated than the spatial domains in 
traditional image scale space. We show how this 
can be made possible with a set of assumptions and 
simplifications. The scale-space model for text not 
only provides new perspectives for how text analy- 
sis tasks can be formulated and addressed, but also 
enables well-established computer vision tools to be 
adapted and applied to text processing, e.g., match- 
ing, segmentation, description, interests points de- 
tection, and classification. To stimulate further in- 
vestigation in this promising direction, we initiate 
a couple of instantiations to demonstrate how this 
model can be used in a variety of NLP and text anal- 
ysis tasks to make things easier, better, and most im- 
portantly, scale-invariant. 

2 Scale Space Representation 

The notion of scale space is applicable to signals of 
arbitrary dimensions. Let us consider the most com- 
mon case, where it is applied to 2-dimensional sig- 
nals such as images. Given an image f{xi,X2), its 
scale-space representation 7(^1 , X2 , s) is defined by: 

'y{xi,X2, s) = /(xi, X2) * iixi,X2, s) (1) 

= / f{xi - Ul,X2 - U2)i{ui,U2,s)duidU2, 

where * denotes the convolution operator, and i : 
X M_(. — )• M is a smoothing kernel (i.e., a low pass 
filter) with a set of desired properties (i.e., the scale- 



space axioms (Lindeberg, 1994)). The bandwidth 
parameter s is referred to as scale parameter since 
as s increases, the derived image will become grad- 
ually smoother (i.e., blurred) and consequently more 
and more fine-scale structures will be suppressed. 

It has been shown that the Gaussian kernel is the 
unique option that satisfies the conditions for linear 
scale space: 
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(2) 



The resultant linear scale space representation 
7(xi, X2, s) can be obtained equivalently as a solu- 
tion to the diffusion (heat) equation 



with initial condition 7(x, 0) = /(x), where A de- 
notes the Laplace operator which in a 2-dimensional 
spatial space corresponds to 
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If we view 



(3) 



7 as a heat distribution, the equation essentially de- 
scribes how it diffuses from initial value, /, in a ho- 
mogeneous media with uniform conductivity over 
time s. As we can imagine, the distribution will 
gradually approach uniform and consequently the 
fine-scale structure of / will be lost. 

Scale-space theory provides a formal framework 
for handling the multi-scale nature of both the phys- 
ical world and the human perception. Since its in- 
troduction in 1980s, it has become the foundation of 
many computer vision techniques and been widely 
applied to a large variety of vision/image processing 
tasks. In this paper, we show how this powerful tool 
can be adapted and applied to natural language texts. 

3 Scale Space Model for Text 

3.1 Word-level 2D Image Analogy of Text 

A straightforward step towards textual sale space 
would be to represent texts in the way as image sig- 
nal. In this section, we show how this can be made 
possible. Other alternative signal formulations will 
be discussed in the followed section. 

Let V = {^1,^2, • • • , vm} be our vocabulary con- 
sisting of M words, given a document d comprised 
of a finite A^-word sequence d = wiW2 ■ ■ - wn, 
without any information loss, we can characterize d 
as a 2D N x M binary matrix /, with the (x, y)- 
th entry f{x,y) indicates whether or not the y-th 
vocabulary word Vy is observed at the x-th posi- 
tion, i.e.: f{x,y) = 6{wx,Vy), where 6{a,b) = 1 
if a = 6 and otherwise. Hereafter, we will re- 
fer to the X-axis as spatial domain (i.e., positions in 
the document, x G X = {1, . . . , A^}), and y-axis 
as the semantic axis (i.e., indices in the vocabulary, 
y £ y = V). This representation provides an image 
analogy to text, i.e., a document / is equivalent to a 
black-and-white image except that here we have one 
spatial and one semantic domains, (x, y), instead of 
two spatial domains, (xi, X2). 

Interestingly, scale-space representation can also 
be motivated by this binary model from a slightly 
different perspective, as a way of robust density es- 
timation. We have the following definition: 

Definition 1. A 2D text model f G R^""^ is 



a probabilistic distribution over the joint spatial- 
semantic space: x 3^ — ^ f{x,y) ^ 1, 
!a^!yf{x,y)dxdy = 1. 

This 2D text model defines tlie probability of ob- 
serving a semantic word y at a spatial position x. 
The binary matrix representation (after normaliza- 
tion) can be understood as an estimation of / with 
kernel density estimators: 

where Cj is the i-th column vector of an iden- 
tity matrix, f(x, •) denotes the x-th row vector 
and f (•,?/) the y-th column vector. Note that 
here the Dirac impulse kernels 6 is used, i.e., 
words are unrelated either spatially or semanti- 
cally. This contradicts the common knowledge 
since neighboring words in text are highly corre- 
lated both semantically ( |Mei et al., 2008| ) and spa- 
tially ( [Lebanon et al., 20071 ). For instance, observ- 
ing the word "New" indicates a high likelihood 
of seeing the other word "York" at the next posi- 
tion. As a result, it motivates more reliable esti- 
mate of / by using smooth kernels such as Gaussian 
dWitkin, 19831 [Lindeberg, 1994[ ), which, as we will 
see, leads exactly to the Gaussian filtering used in 
the linear scale-space theory. 



• Sentence-level 2D signal is a compromise be- 
tween word-level 2D and the BOW signals. In- 
stead of collapsing the spatial dimension for the 
whole document, we do it for each sentence. 
As a result, this signal, f{x, y), records the po- 
sition of each sentence; for a fixed position xq, 
f{x = XQ,y) records the BOW of the corre- 
sponding sentence. 

• Topic ID signal, ^{x), is composed of the topic 
embedding of each sentence and defined on the 
spatial domain only. Assume we have trained 
a topic model (e.g.. Latent Dirichlet Alloca- 
tion) on a universal corpus in advance, this sig- 
nal is obtained by applying topic inference to 
each sentence and recording the topic embed- 
ding Ox € M'^, where <C M is the dimen- 
sionality of the topic space. Topic embedding 
is beneficial since it endows us the ability to ad- 
dress synonyms and polysemy. Also note that 
the semantic correlation may have been elimi- 
nated and consequently semantic smoothing is 
no longer necessary. In other words, although 
f(x) is a matrix, we would rather treat it as a 
vector-variate ID signal. 

All these textual signals involve either a semantic 
domain or both semantic and spatial domains. In the 
following, we investigate how scale-space filtering 
can be applied to these domains respectively. 



3.2 Textual Signals 

The 2D binary matrix described above is not the only 
option we can work with in scale space. Generally 
speaking, any vector, matrix or even tensor represen- 
tation of a document can be used as a signal upon 
which scale space filtering can be applied. In partic- 
ular, we use the following in the current paper: 

• Word-level 2D signal, /(x, y), is the binary ma- 
trix we described in ^3.11 It records the spatial 
position for each word, and is defined on the 
joint spatial-semantic domains. 

• Bag-of-word ID signal is the BOW represen- 
tation f{y) = Ylx fi^^y)' i-^-' matrix 
is collapsed to a ID vector. Since the spatial 
axis is wiped out, this signal is defined on the 
semantic domain alone. 



3.3 Spatial Filtering 

Spatial filtering has long been popularized in signal 
processing (Witkin, 1983, [Lindeberg, 1994) , 
and was recently explored in NLP by 
( [Lebanon et al, 2007t [Yang and Zha, 2010[ ). It 
can be achieved by convolution of the signal with a 
low-pass spatial filter, i.e., 7(x, s) = f{x) * i{x, s). 
For texts, this amounts to borrowing the occurrence 
of words at one position from its neighbor- 
ing positions, similar to what was done by a 
cache-based language model (Jelinek et al., 1991] 
[Beeferman et al., 19991). 



In order not to introduce spurious information, the 
filter i need to satisfy a set of scale-space axioms 



(Lindeberg, 1994 1. If we view the positions in a text 
as a spatial domain, the Gaussian kernel £{x, s) = 



n-Ks 



exp(— a;^/2s) or its discrete counterpart 



l{n,s) 



'Us) 



(6) 



are singled out as the unique options that satisfy 
the set of axiom^ leading to the linear scale space, 
where denotes the modified Bessel functions 
of integer order. Alternatively, if we view the po- 
sition X as a. time variable as in the Markov lan- 
guage models, a Poisson kernel i{n, s) = e~*s"/n! 
is more appropriate as it retains temporal causality 
(i.e., inaccessibility of future data). 

3.4 Semantic Filtering 

Semantic filtering attempts to smooth the probabil- 
ities of seeing words that are semantically corre- 
lated. In contrast to the spatial domain, the se- 
mantic domain has some unique properties. The 
first thing we notice is that, as semantic coordi- 
nates are nothing but indices to the dictionary, we 
can permute them without changing the seman- 
tic meaning of the representation. We refer to 
this property as permutation invariance. Semantic 
smoothing has been extensively explored in natural 



language processing (Manning and Schuetze, 1999 



Zhai and Lafferty, 2004 1. Classical smoothing 
methods, e.g., Laplacian and Dirichlet smoother, 
usually shrink the original distributions to a prede- 
fined reference distribution. Recent advances ex- 
plored local smoothing where correlated words are 
smoothed according to their interrelations defined 
by a semantic network ( |Mei et al., 2008 [ ). 

Given a semantic graph Q^, where two correlated 
words Vy and Vz are connected with weight jiyz, se- 
mantic smoothing can be formulated as solving a 
graph-based optimization problem: 



mm 

7 



(7) 



where ^ A ^ 1 defines the tradeoff, /Xy weights 
the importance of the node Vy. Interestingly, the so- 
lution to Eqn.© is simply the convolution of the 
original signal with a specific kerneH, i.e., j = f*£. 

'including linearity, shift-invariance, semi-group structure, 
non-enhancement of local extrema (i.e., monotonicity), scale- 
invariance, etc.; see ( [Lindeberg, 1994[ l for details and proofs. 

^This can be proven by the first-order optimality of EqiO. 



Compared with spatial filtering, semantic filter- 
ing is, however, more challenging. In particular, 
the semantic domain is heterogeneous and not shift- 
invariant — the degree of correlation fiyz depends on 
both coordinates y and z rather than their difference 
{y — z). As a result, kernels that provably satisfy 
scale-space axioms are no longer feasible. To this 
end, we simply set aside these requirements and de- 
fine kernels in terms of the dissimilarity dyz between 
a pair of words y and z rather than their direct differ- 
ence (y — z), that is, z; s) = £x{dyz, s), where 
we use iy to denote semantic kernel to distinguish 
from spatial kernels ix- For Gaussian, this means 



/2lTS 



exp{-dlz/2s) 



3.5 Text Scale Space 

Scale is vital for the understanding of natural lan- 
guage, yet it is nontrivial to determine which scale is 
appropriate for a specific task at hand in advance. As 
a matter of fact, the best choice usually varies from 
task to task and from document to document. Even 
within one document, it could be heterogeneous, 
varying from paragraph to paragraph and sentence 
to sentence. For the purpose of automatic modeling, 
there is no way to decide a priori which scale fits 
the best. More importantly, it might be impossible 
to capture all the right meanings at a single scale. 
Therefore, the only reasonable way is to simulta- 
neously represent the document at multiple scales, 
which is exactly the notion of scale space. 

Scale space representation embeds a textual sig- 
nal into a continuous scale-space, i.e., by a family 
of progressively smoothed signals parameterized by 
continuous scale parameters. In particular, for a 2D 
textual signal /(x, y), we have: 

7(3;, y; Sx, Sy) = f{x, y) * i{x, y; s^, Sy), (8) 

where the 2D smoothing kernel i is separable be- 
tween spatial and semantic domains, i.e.. 



•^(•^) U) Sy) £x{x^ Sx)(-y{y 1 Sy). 



(9) 



Note that we have two continuous scale parameters, 
the spatial scale Sx G and the semantic scale 
Sy G M+. The case for ID signals are even simpler 
as they only involve one type of kernels (spatial or 
semantic). For a ID spatial signal f{x), we have i = 
£x, and for a semantic signal f{y),£ = £y. And if 




Figure 1 : Samples from the scale space representation of 
the example text "New York Times offers free iPhone 3G 
as gifts for new customers in New York" at scales (from 
left to right): s = (0, 0), (1, 1), (4, 4), and (64, 64). 

f is a vector-variate signal, we just apply smoothing 
to each of its dimensions independently. 

Example. As an example, Figure [T]shows four sam- 
ples, {7(3;, y;s = Si), i = 1,2,3,4}, from the 
scale-space representation 7(x, y; s) of a synthetic 
short text "New York Times offers free iPhone 3G 
as gifts for new customers in New York", where s = 
{sx, Sy), the two scales are set equal Sx = Sy for ease 
of explanation and 7 is obtained based on the word- 
level 2D signal. We use a vocabulary containing 
12 words (in order): "new", "york", "time", "free", 
"iPhone", "gift", "customer", "apple", "egg", "city", 
"service" and "coupon", where the last four words 
are chosen because of their strong correlations with 
those words that appear in this text. The semantic 
graph is constructed based on pairwise mutual in- 
formation scores estimated on the RCV1-V2 corpus 
as well as a large set of Web search queries. The 
(0,0)-scale sample, or the original signal, is a 12 x 10 
binary matrix, recording precisely which word ap- 
pears at which position. The smoothed signals at 
(1,1), (2,2) and (8,8)-scales, on the other hand, cap- 
ture not only short-range spatial correlations such as 
bi-gram, tri-gram and even higher orders (e.g., the 
named entities "New York" and "New York Times"), 
but also long-range semantic dependencies as they 
progressively boost the probability of latent but se- 
mantically related topics, e.g., "iPhone" — "ap- 
ple", "customer" — "service", "free" and "gift" — )• 
"coupon", "new" and "iPhone" — )• "egg" (due to the 
online electronics store newegg.com). 

4 Scale Space Applications 

The scale-space representation creates a new dimen- 
sion for text analysis. Besides providing a multi- 
scale representation that allows texts to be analyzed 
in a scale-invariant fashion, it also enables well- 



established computer vision tools to be adapted and 
applied to analyzing texts. The scale space model 
can be used in NLP and text mining in a variety of 
ways. To stimulate further research in this direction, 
we initiate a couple of instantiations. 

4.1 Scale-Invariant Text Classification 

In this section, we show how to make text classi- 
fication scale-invariant by exploring the notion of 
scale-invariant text kernel (SITK). Given a pair of 
documents, d and d! , at any fixed scale s, the repre- 
sentation 7 induces a single-scale kernel ks{d, d') = 
(75,73), where (•,•) denotes any inner product 
(e.g., Frobenius product, Gaussian RBF similarity, 
Jensen-Shannon divergence). This kernel can be 
made scale-invariant via the expectation: 



k{d,d') ^'Eq[ks{d,d')] 



ks{d,d')q{s)ds, (10) 



where q is a probabilistic density over the scale 
space IR+ with < q{s) ^ 1 and q{s)ds = 1, 
which in essence characterizes the distribution of the 
most appropriate scale, q can be learned from data 
via a EM procedure or in a Bayesian framework if 
our belief about the scale can be encoded into a prior 
distribution qo{s). As an example, we show below 
one possible formulation. 

Given a training corpus V = {(ij, ?/i}?=i' where d 
is a document and y its label, our goal in text clas- 
sification is to minimize the expected classification 
error. To simplify matters, we assume a paramet- 
ric form for q. Particularly, we use the Gamma dis- 
tribution q{s;k,0) = 9^ s^~'^e~^'' /T{k) due to its 
flexibility. Moreover, we propose a formulation that 
eliminates the dependence on the choice of the clas- 
sifier, which approximately minimizes the Bayes er- 
ror rate ( |Yang and Hu, 2008] l , i.e.: 



En 
, ¥.q[hi{s)] (11) 

where hi{s) = As{di,d^) — As(dj, d^) is a heuris- 
tic margin; d^, called "nearest-hit", is the nearest 
neighbor of di with the same class label, whereas 
d™, the "nearest-miss", is the nearest neighbor of di 
with a different label, and the distance As{d, d') = 
y/ks{d,d) + ks{d',d') - 2ks{d,d'). This above 
formulation can be solved via a EM procedure. Al- 
ternatively, we can discretize the scale space (prefer- 
ably in log-scale), i.e., S = {si, . . . , Sm}, and opti- 
mize a discrete distribution qj = q{sj) directly from 



the same formulation. In particular, if we regularize 
the ^2-iiorm of q, Eq ifTT]) will become a convex opti- 
mization with a close-form solution that is extremely 
efficient to obtain: 

q = (h)+/||(h)+|| (12) 

where q = [qi, . . . , qmV , the average margin vector 
h = [hi,..., hmV with entry hj = ^ Y17=i hi{sj), 
and (•)+ denotes the positive -part operator. 

Experiments. We test the scale-invariant text 
kernels (SITK) on the RCVl-v2 corpus with fo- 
cus on the 161,311 documents from ten leaf- 
node topics: CI 1 , C24, C42, E211, E512, 
GJOB, GPRO, M12, M131 and M142. Each 
text is stop-worded and stemmed. The top 20K 
words with the highest DFs (document frequencies) 
are selected as vocabulary; all other words are dis- 
carded. The semantic network is constructed based 
on pairwise mutual information scores estimated on 
the whole RCVl corpus as well as a large scale 
repository of web search queries, and further spar- 
sified with a cut-off threshold. We implemented the 
sentence-level 2D, the LDA ID signals and BOW 
ID for this task. For the first two, the documents are 
normalized to the length of the longest one in the 
corpus via bi-linear interpolation. 

We examined the classification performance of 
the SVM classifiers that are trained on the one-vs- 
all splits of the training data, where three types of 
kernels (i.e., linear (Frobenius), RBF Gaussian and 
Jensen-Shannon kernels) were considered. The av- 
erage test accuracy (i.e.. Micro-averaged Fl) scores 
are reported in Table [T] As a reference, the re- 
sults by BOW representations with TF or TFIDF at- 
tributes are also included. For all the three kernel 
options, the scale-space based SITK models signifi- 
cantly (according to i-test at 0.01 level) outperform 
the two BOW baselines, while the sentence level 
SITK performs substantially the best with 7.8% ac- 
curacy improvement (i.e., 56% error reduction). 

4.2 Scale-Invariant Document Retrieval 

Capturing users' information need from their in- 
put queries is crucially important to information re- 
trieval, yet notoriously challenging because the in- 
formation conveyed by a short query is far more 
vague and subtle than a BOW model can capture. It 



Table 1 : Text classification test accuracy. We compared 
five models: the bag-of-word vector space models with 
TF or TFIDF attributes, and the scale-invariant text ker- 
nels with BOW ID (SITK.BOW), LDA ID (SITK.LDA) 
and Sentence-level 2D (SITK.Sentence) textual signal. 
Best results are highlighted in bold. 



Model\Kernel 


Linear 


RBF 


J-S 


TF 

TFIDF 


0.8789 
0.8821 


0.9087 
0.9099 


0.8901 
0.9016 


SITK.BOW 
SITK.LDA 
SITK.Sentence 


0.8917 
0.9284 
0.9473 


0.9143 
0.9312 
0.9525 


0.9076 
0.9239 
0.9496 



is therefore desirable to base search on more effec- 
tive text representations than BOW. We show here 
how scale space model, together with interest point 
detection techniques, can be used to make a retrieval 
model scale-invariant and more effective. 

Given a set of documents {d} and a query Q, our 
goal is to rank the documents according to their rel- 
evance w.r.t. Q. The key to text retrieval is a rele- 
vance model r{Q, d). We define r in the same spirit 
as we develop the SITK. In particular, if we normal- 
ize the representations of Q and d to the same di- 
mension, e.g., via bi-linear interpolation^ then at 
any fixed scale s, the scale space model induces a 
relevance function r{Q,d\s) = {'^Q,^d) (e.g., via 
KL-divergence, Jessen-Shannon score). This rele- 
vance model can be made scale invariant by defining 
a distribution q over the scale space and using: 

r{Q,d)=¥.^[r{Q,d\s)], (13) 

which is referred to as scale-invariant language 
model (SILM). As in ^4.1[ q can be learned through 
a Bayesian framework or via a EM procedure. As 
an example, assume q is again a Gamma distribution 
with parameter {k,9). Moreover, assume we have 
a training corpus containing a set of queries {Q}, 
and for each Q a set of documents {d'^} along with 
their relevance judgements. We have the following 
pairwise preference learning formulation: 

^In the case of the sentence-level 2D or LDA ID signals, 7q 
is a vector and yd is a matrix, this simply amounts to replicat- 
ing 7q to the same dimension as 7^, or equivalently applying a 
sentence-level sliding-window to d, calculating r at each point 
and summating the relevance scores. 



where the pairwise margin h{Q,i,j\s) = 
r{Q,df\s) - r{Q,df\s), and df y df means df 
is more relevant to Q than d'j. This formulation can 
be solved efficiently via a similar EM procedure, 
and again in the discrete case with ^2-regularization 
has an efficient close-form solution: 



, m. 



where the average margin h = [hi, . . . 
hi = Y.QY.d9-^d9KQ^hj\si),l = l, 

More interestingly, scale-space model can also be 
used, together with techniques for interest point de- 
tection dLowe, 200 4*), to address passage retrieval 



(PR) in a scale-invariant manner, i.e., to determine 
not only which documents are relevant but also 
which parts of them are relevant. PR is partic- 
ularly advantageous when documents are substan- 
tially longer than queries or when they span a large 
variety of topic areas, for example, when retrieving 
books. A key challenge in PR is how to effectively 
narrow our attention to a small part of a long docu- 
ment. Existing approaches mostly employ a sliding- 
window style exhaustive search, i.e., scan through 
every possible passage, compute relevance scores 
and rank all of them ( [Tellex et al., 2003] ). These ap- 
proaches suffer from computational efficiency issues 
since the number of possible passages could be quite 
large for long documents. Here we propose a new 
idea which employs interest point detection (IPD) 
algorithms to quickly focus our attentions to a small 
set of potentially relevant passages. In particularly, 
for a given {Q, d) pair, we first apply IPD (without 
normalization) to both 7q and in scale space, then 
match them locally between region pairs centered at 
each interest point and calculate the relevance scores 
there. 

Experiments. We evaluated SILM on a text re- 
trieval task based on the OHSUMED data set, a 
collection of 348,566 documents, 106 queries and 
16,140 relevance judgements. Similar preprocess- 
ing steps as in ^4.11 were implemented. For SILM, 
standard Kullback-Leibler divergence was used as 
relevance function. For comparison, the unigram 
language model (i.e., 1-LM) was used as baselines. 
The results are reported in Table |2] in terms of three 
standard IR evaluation measures, i.e., the Mean- 
Average-Precision (MAP), Precision at N with N=5 



Table 2: Text retrieval performance. We evaluate four 
models: the Unigram Language Model (1-LM) and the 
Scale-Invariant Language Models with three textual sig- 
nal options (refen-ed to as SILM.BOW, SILM.LDA and 
SILM. Sentence respectively). 





Model\Measure 


MAP 


P@5 


P@10 




1-LM 


0.2699 


0.4812 


0.4659 


(15) 


SILM.BOW 


0.2807 


0.5076 


0.4762 




SILM.LDA 


0.2839 


0.5154 


0.4981 


hm]^ with 


SILM. Sentence 


0.3099 


0.5447 


0.5108 



and 10 (i.e., P@5 and P@10). We observe that 
SILM models outperform the uni-gram LM amaz- 
ingly by (up to) 15% in terms of MAP, 13% in P@5 
and 10% in P@10. All these improvements are sig- 
nificant based on a Wilcoxon test at the level of 
0.01. Again, the best performance is obtained by 
the sentence-level 2D based SILM model. 

4.3 Hierarchical Document Keywording 

The extrema (i.e., maxima and minima) of a signal 
and its first a few derivatives contain important infor- 
mation for describing the structure of the signal, e.g., 
patches of significance, boundaries, comers, ridges 
and blobs in an image. Scale space model provides 
a convenient framework to obtain the extrema of a 
signal at different scales. In particular, the extrema 
in the (A; — l)-th derivative of a signal is given by the 
zero-crossing in the fe-the derivative, which can be 
obtained at any scale in the scale space conveniently 
via the convolution of the original signal with the 
derivative of the Gaussian kernel, i.e.: 



Qk 



7 



/ * — — ^. 
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Since Gaussian kernel is infinitely differentiable, the 
scale-space model makes it possible to obtain lo- 
cal extrema/derivatives of a signal to arbitrary or- 
ders even when the signal itself is undifferentiable. 
Moreover, due to the "non-enhancement of local ex- 
trema" property, local extrema are created monoton- 
ically as we decrease the scale parameter s. In this 
section, we show how this can be used to detect 
keywords from a document in a hierarchical fash- 
ion. The idea is to work with the word-level 2D sig- 
nal (other options are also possible) and track the 
extrema (i.e., patterns of significance) of the scale- 
space model 7 through the zero-crossing of its first 



derivative 7' = to see how extrema progressively 
emerge as the scale s goes from coarse to finer lev- 
els. This process reduces the scale-space represen- 
tation to a simple ternary tree in the scale space, i.e., 
the so-called "interval tree" in dWitkin, 1983| ). Since 
/ defines a probability over the spatial-semantic 
space, it is straightforward to interpret the identi- 
fied intervals as keywords. This algorithm therefore 
yields a keyword tree that defines topics we could 
perceive at different levels of granularities from the 
document. 

Experiments. As an illustrative example, we ap- 
ply the hierarchical keywording algorithm described 
above to the current paper. The keywords that 
emerged in order are as follows: "scale space" — )• 
"kernel", "signal", "text" "smoothing", "spatial", 
"semantic", "domains", "Gaussian", "filter", "text 
analysis", "natural language", "word" — )■ .... 

4.4 Hierarchical Text Segmentation 

In the previous section, we show how semantic key- 
words can be extracted from a text in a hierarchi- 
cal way by tracking the extrema of its scale space 
model 7. In the same spirit, here we show how topic 
boundaries in a text can be identified by tracking the 
extrema of the first derivative 7'. 

Text segmentation is an important topic in 
NLP and has been extensively investigated previ- 
ously dBeeferman et al., 1999[ ). Many existing ap- 
proaches, however, are only able to identify a flat 
structure, i.e., all the boundaries are identified at a 
flat level. A more challenging task is to automat- 
ically identify a hierarchical table-of-content style 
structure for a text, that is, to organize boundaries 
of different text units in a tree structure according 
to their topic granularities, e.g., chapter boundaries 
at the top-level, followed in order by boundaries of 
sections, subsections, paragraphs and sentences as 
the level of depth increases. This can be achieved 
conveniently by the interval tree and coarse-to-fine 



tracking idea presented in dWitkin, I983| . In partic- 
ular, if we keep tracking the extrema of the 1st order 
derivatives (i.e., rate of changes) by looking at the 
points satisfying: 
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I, ill,, Jill lii; 










q2 q3 

—7 = 0, while -^7 ^ 0. 
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Figure 2: Hierarchical text segmentation in scale space. 

Due to the monotonicity nature of scale space repre- 
sentation, such contours are closed above but open 
below in the scale space. They naturally illustrate 
how topic boundaries appear progressively as scale s 
goes finer. And the exact localization of a boundary 
can be obtained by tracking back to the scale s = 0. 
Also note that this algorithm, unlike many existing 
ones, does not require any supervision information. 

Experiments. As an example, we apply the hierar- 
chical segmentation algorithm to the current paper. 
We use the sentence level 2D signal. Let 7(x, s^) 
denote the vector 7(x, •, Sx, Sy = C), where the se- 
mantic scale Sy is fixed to a constant C, and the se- 
mantic index y enumerates through the whole vo- 
cabulary {y = vi, . . . ,vm}- We identify hier- 
archical boundaries by tracking the zero contours 
||^7(x, Sx)||2 = (where || • II2 denotes ^2-norm) 
to the scale s = 0, where the length of the projec- 
tion in scale space (i.e., the vertical span) reflects 
each contour line's topic granularity, as plotted in 
Figure |2] (top). As a reference, the velocity mag- 
nitude curve (bottom) ||^7(x, S2,.)||2, and the true 
boundaries of sections (red-dashed vertical lines in 
top figure) and subsections (green-dashed) are also 
plotted. As we can see, the predictions match the 
ground truths with satisfactorily high accuracy. 

5 Summary 

This paper presented scale-space theory for text, 
adapting concepts, formulations and algorithms that 
were originally developed for images to address the 
unique properties of natural language texts. We also 
show how scale-space models can be utilized to fa- 
cilitate a variety of NLP tasks. There are a lot of 
promising topics along this line, for example, al- 
gorithms that scale up the scale-space implementa- 



tions towards massive corpus, structures of the se- 
mantic networks that enable efficient or even close- 
form scale-space kernel/relevance model, and effec- 
tive scale-invariant descriptors (e.g., named entities, 
topics, semantic trends in text) for texts similar to 
the SIFT feature for images ( |Lowe, 2004[ ). 



[Zhai and Lafferty2004] C. Zhai and J. Lafferty. 2004. A 
study of smoothing methods for language models ap- 
plied to information retrieval. ACM TOIS, 22(2): 179- 
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